Hòa Bình Province
Fake Advertisements Detection Using Automated Multimodal Learning: A Case Study for Vietnamese Real Estate Data
Nguyen, Duy, Nguyen, Trung T., Nguyen, Cuong V.
The popularity of e-commerce has given rise to fake advertisements that can expose users to financial and data risks while damaging the reputation of these e-commerce platforms. For these reasons, detecting and removing such fake advertisements are important for the success of e-commerce websites. In this paper, we propose FADAML, a novel end-to-end machine learning system to detect and filter out fake online advertisements. Our system combines techniques in multimodal machine learning and automated machine learning to achieve a high detection rate. As a case study, we apply FADAML to detect fake advertisements on popular Vietnamese real estate websites. Our experiments show that we can achieve 91.5% detection accuracy, which significantly outperforms three different state-of-the-art fake news detection systems.
- Asia > Vietnam > Hồ Chí Minh City > Hồ Chí Minh City (0.04)
- North America > United States > New York (0.04)
- North America > Canada > Ontario > Toronto (0.04)
- (3 more...)
- Marketing (1.00)
- Banking & Finance > Real Estate (1.00)
- Information Technology > Services > e-Commerce Services (0.95)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.69)
- Information Technology > Artificial Intelligence > Machine Learning > Ensemble Learning (0.69)
Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges
Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet
Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.
- Asia > Vietnam > Hanoi > Hanoi (0.14)
- Asia > Vietnam > Thanh Hóa Province > Thanh Hóa (0.04)
- Asia > Vietnam > Hưng Yên Province > Hưng Yên (0.04)
- (65 more...)
VietMed: A Dataset and Benchmark for Automatic Speech Recognition of Vietnamese in the Medical Domain
In this work, we present VietMed - a Vietnamese speech recognition dataset in the medical domain comprising 16h of labeled medical speech, 1000h of unlabeled medical speech and 1200h of unlabeled general-domain speech. To our best knowledge, VietMed is by far the world's largest public medical speech recognition dataset in 7 aspects: total duration, number of speakers, diseases, recording conditions, speaker roles, unique medical terms and accents. VietMed is also by far the largest public Vietnamese speech dataset in terms of total duration. Additionally, we are the first to present a medical ASR dataset covering all ICD-10 disease groups and all accents within a country. Moreover, we release the first public large-scale pre-trained models for Vietnamese ASR, w2v2-Viet and XLSR-53-Viet, along with the first public large-scale fine-tuned models for medical ASR. Even without any medical data in unsupervised pre-training, our best pre-trained model XLSR-53-Viet generalizes very well to the medical domain by outperforming state-of-the-art XLSR-53, from 51.8% to 29.6% WER on test set (a relative reduction of more than 40%). All code, data and models are made publicly available here.
- North America > United States (0.14)
- Europe > Germany (0.14)
- North America > Canada > Ontario > Toronto (0.14)
- (15 more...)
- Information Technology > Security & Privacy (1.00)
- Health & Medicine > Therapeutic Area (1.00)
- Government (1.00)
- (3 more...)
Sandwich attack: Multi-language Mixture Adaptive Attack on LLMs
Upadhayay, Bibek, Behzadan, Vahid
Large Language Models (LLMs) are increasingly being developed and applied, but their widespread use faces challenges. These include aligning LLMs' responses with human values to prevent harmful outputs, which is addressed through safety training methods. Even so, bad actors and malicious users have succeeded in attempts to manipulate the LLMs to generate misaligned responses for harmful questions such as methods to create a bomb in school labs, recipes for harmful drugs, and ways to evade privacy rights. Another challenge is the multilingual capabilities of LLMs, which enable the model to understand and respond in multiple languages. Consequently, attackers exploit the unbalanced pre-training datasets of LLMs in different languages and the comparatively lower model performance in low-resource languages than high-resource ones. As a result, attackers use a low-resource languages to intentionally manipulate the model to create harmful responses. Many of the similar attack vectors have been patched by model providers, making the LLMs more robust against language-based manipulation. In this paper, we introduce a new black-box attack vector called the Sandwich attack: a multi-language mixture attack, which manipulates state-of-the-art LLMs into generating harmful and misaligned responses. GPT-4, and Claude-3-OPUS, show that this attack vector can be used by adversaries to generate harmful responses and elicit misaligned responses from these models. By detailing both the mechanism and impact of the Sandwich attack, this paper aims to guide future research and development towards more secure and resilient LLMs, ensuring they serve the public good while minimizing potential for misuse. Content Warning: This paper contains examples of harmful language. Ethics and Disclosure This paper introduces a new universal attack method for the SOTA LLMs that could potentially be used to elicit harmful content from publicly available LLMs. The adversarial attack method we used in this paper is easy to design and requires low-cost to implement. Despite the associated risks, we firmly believe that sharing the full details of this research and its methodology will be invaluable to other researchers, scholars, and model creators. It encourages them to delve into the root causes behind these attacks and devise ways to fortify and patch existing models. Additionally, it promotes cooperative initiatives centered around the safety of LLMs in multilingual scenarios.
- Asia > Vietnam > Hòa Bình Province > Hòa Bình (0.04)
- North America > United States > Connecticut > New Haven County > New Haven (0.04)
- Asia > South Korea (0.04)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Law Enforcement & Public Safety (0.93)
- Government (0.88)
Location Agnostic Adaptive Rain Precipitation Prediction using Deep Learning
Islam, Md Shazid, Rahman, Md Saydur, Haque, Md Saad Ul, Tumpa, Farhana Akter, Hossain, Md Sanzid Bin, Arabi, Abul Al
Rain precipitation prediction is a challenging task as it depends on weather and meteorological features which vary from location to location. As a result, a prediction model that performs well at one location does not perform well at other locations due to the distribution shifts. In addition, due to global warming, the weather patterns are changing very rapidly year by year which creates the possibility of ineffectiveness of those models even at the same location as time passes. In our work, we have proposed an adaptive deep learning-based framework in order to provide a solution to the aforementioned challenges. Our method can generalize the model for the prediction of precipitation for any location where the methods without adaptation fail. Our method has shown 43.51%, 5.09%, and 38.62% improvement after adaptation using a deep neural network for predicting the precipitation of Paris, Los Angeles, and Tokyo, respectively.
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.26)
- North America > United States > California > Los Angeles County > Los Angeles (0.26)
- Asia > Bangladesh > Dhaka Division > Dhaka District > Dhaka (0.06)
- (9 more...)
Smart Home Goal Feature Model -- A guide to support Smart Homes for Ageing in Place
Logothetis, Irini, Rani, Priya, Sivasothy, Shangeetha, Vasa, Rajesh, Mouzakis, Kon
Smart technologies are significant in supporting ageing in place for elderly. Leveraging Artificial Intelligence (AI) and Machine Learning (ML), it provides peace of mind, enabling the elderly to continue living independently. Elderly use smart technologies for entertainment and social interactions, this can be extended to provide safety and monitor health and environmental conditions, detect emergencies and notify informal and formal caregivers when care is needed. This paper provides an overview of the smart home technologies commercially available to support ageing in place, the advantages and challenges of smart home technologies, and their usability from elderlys perspective. Synthesizing prior knowledge, we created a structured Smart Home Goal Feature Model (SHGFM) to resolve heuristic approaches used by the Subject Matter Experts (SMEs) at aged care facilities and healthcare researchers in adapting smart homes. The SHGFM provides SMEs the ability to (i) establish goals and (ii) identify features to set up strategies to design, develop and deploy smart homes for the elderly based on personalised needs. Our model provides guidance to healthcare researchers and aged care industries to set up smart homes based on the needs of elderly, by defining a set of goals at different levels mapped to a different set of features.
- Oceania > Australia (0.14)
- North America > United States > New York (0.04)
- Europe > Switzerland (0.04)
- (3 more...)
- Research Report (0.64)
- Overview (0.54)
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models
Dao, Xuan-Quy, Le, Ngoc-Bich, Vo, The-Duy, Phan, Xuan-Dung, Ngo, Bac-Bien, Nguyen, Van-Tien, Nguyen, Thi-My-Thanh, Nguyen, Hong-Phuoc
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed exclusively for evaluating large language models (LLMs), is introduced in this article. The dataset, which covers nine subjects, was generated from the Vietnamese National High School Graduation Examination and comparable tests. 300 literary essays have been included, and there are over 19,000 multiple-choice questions on a range of topics. The dataset assesses LLMs in multitasking situations such as question answering, text generation, reading comprehension, visual question answering, and more by including both textual data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on the VNHSGE dataset and contrasted their performance with that of Vietnamese students to see how well they performed. The results show that ChatGPT and BingChat both perform at a human level in a number of areas, including literature, English, history, geography, and civics education. They still have space to grow, though, especially in the areas of mathematics, physics, chemistry, and biology. The VNHSGE dataset seeks to provide an adequate benchmark for assessing the abilities of LLMs with its wide-ranging coverage and variety of activities. We intend to promote future developments in the creation of LLMs by making this dataset available to the scientific community, especially in resolving LLMs' limits in disciplines involving mathematics and the natural sciences.
- North America > United States (1.00)
- Europe > Russia (0.14)
- Asia > Russia (0.14)
- (12 more...)
- Instructional Material (0.92)
- Research Report > New Finding (0.47)
- Education > Curriculum (1.00)
- Government > Regional Government > North America Government > United States Government (0.92)
- Education > Educational Setting > K-12 Education > Secondary School (0.91)